Chapter 4: Clustering and classification

#Get access to the libraries
library(MASS)
library(ggplot2)
library(GGally)
library(tidyr)
#install.packages("plotly")
library(tidyverse)
library(corrplot)
library(dplyr)
library(plotly)
data("Boston")
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
#structure of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#dimension of the data
dim(Boston)
## [1] 506  14
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

The data includes information about Housing Values in Suburbs of Boston. The data has 14 variables and 506 observations. The relevant information in the dataset is described below: |Variable|Definition |—|————————————————————————————| |crim |per capita crime rate by town| |zn|proportion of residential land zoned for lots over 25,000 sq.ft| |indus|proportion of non-retail business acres per town| |chas|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)| |nox|nitrogen oxides concentration (parts per 10 million)| |rm|average number of rooms per dwelling| |age|proportion of owner-occupied units built prior to 1940| |dis|weighted mean of distances to five Boston employment centres| |rad|index of accessibility to radial highways| |tax|full-value property-tax rate per $10,000| |ptratio|pupil-teacher ratio by town| |black|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town| |lstat|lower status of the population (percent)| |medv|median value of owner-occupied homes in $1000s|

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

3. Graphical overview of the data

  • Below are the graohical representations which show the distribution and the correlations of the variables.
#graphical overview of the data
#ggpairs(Boston)
#pairs(Boston)
#plot(Boston)

#summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")

From the above, we can see the positive and negative correlations. For instance, industrial land use and nitrogen oxide concentration are positively correlated. Industrial land use is also positively correlated with tax. Crime is positively correlated with index of accessibility to radial highways. “age” and “dis” are negatively correlated and “dis” and “tax” are slightly negatively correlated too. From the correlation plot, we can see other weak to strong negative and positive correaltios between the variables.

4. Standardising the dataset and others:

Linear discrimant analysis will be performed later, hence, there is need to scale the entire dataset. this is done by subtrating the columns means from the various columns and divide this difference by the standard deviation of the column.

  • center and standardize variables
boston_scaled <- scale(Boston)
  • summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
  • class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
  • change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
  • summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
  • create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
  • create a categorical variable ‘crime’
  • Here, the per capita crime(crim) is cut into quantiles and converted from continuous to categorical varibale. It is further made into factor variable to derive the different levels of crime.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
  • look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

To avoid confusion, I will be deleting old continuous variable “crim” for the newly created categorical variable “crime” - remove original crim from the dataset

boston_scaled <- dplyr::select(boston_scaled, -crim)
  • add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
  • Sampling Here, the data is split into training and testing data to allow for predictive performance of my model. The training is done with the train data while the testing is done with the test data.
#number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5. linear discriminant analysis

Here crime is made as the target variable while all other variables as the predictor. Linear discriminant analysis (LDA) which is a generalization of Fisher’s linear discriminant, is a method utilised for finding a linear combination of features that separates or characterises two or more objects’ or events’ classes. it is used in recognising patterns, machine and statistical learning. see more here.

Put differently, it is a classification (and dimension reduction) method which finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.

Linear discriminant analysis is akin to many other methods, such as principal component analysis. LDA can be visualized with a biplot.

lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2425743 0.2549505 0.2524752 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.94893162 -0.8778878 -0.11640431 -0.8542126  0.47491361
## med_low  -0.05715626 -0.4071372 -0.03128211 -0.6109729 -0.12061566
## med_high -0.39482539  0.1662738  0.22458650  0.3498340  0.08058164
## high     -0.48724019  1.0171096 -0.07933396  1.0522509 -0.46340340
##                 age        dis        rad        tax    ptratio      black
## low      -0.8906556  0.8041307 -0.6839520 -0.7443760 -0.4486834  0.3859427
## med_low  -0.3892717  0.4504521 -0.5412349 -0.5184010 -0.0996504  0.3132164
## med_high  0.4011906 -0.3853687 -0.4332831 -0.3159931 -0.2140010  0.1035878
## high      0.7966261 -0.8573967  1.6382099  1.5141140  0.7808718 -0.7665123
##                lstat        medv
## low      -0.78189472  0.55935231
## med_low  -0.15002371  0.02061982
## med_high -0.01261066  0.17567893
## high      0.92240303 -0.75582854
## 
## Coefficients of linear discriminants:
##                   LD1         LD2         LD3
## zn       0.0800863198  0.67740838 -1.13182424
## indus   -0.0008366672 -0.08657272  0.09638975
## chas    -0.1047118453 -0.05848430  0.05335957
## nox      0.3913332525 -0.78573755 -1.18834862
## rm      -0.1608920918 -0.06377917 -0.19125303
## age      0.1625599193 -0.33230150 -0.05076391
## dis     -0.0937611947 -0.16465472  0.47200752
## rad      3.3264105206  1.15602402 -0.05813802
## tax      0.1106209972 -0.24964841  0.63697319
## ptratio  0.0999909991 -0.04454611 -0.39361959
## black   -0.1249690882  0.05820168  0.10194010
## lstat    0.2357282115 -0.20311481  0.36442187
## medv     0.1890251240 -0.40611176 -0.17431270
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9534 0.0345 0.0121

LDA utilises the trained model to calculate the probabilities of the observations belonging to the various classes and thereafter, classifies the observations to the the class which is most likely(probable.)

In order to visualise the result, the function coined from the datacamp exercise will be used. This was originally derived from a comment in stackoveflow here

  • the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
  • target classes as numeric
train$crime <- as.numeric(train$crime)
  • plot the lda results
plot(lda.fit, dimen = 2, col = train$crime, pch= train$crime)
lda.arrows(lda.fit, myscale = 2)

The above arrows depict the relationship between the original variables and the LDA solution.

6. The predictive performance of the LDA

  • save the correct classes from test data
correct_classes <- test[,"crime"]
  • remove the categorical crime variable from test data
test <- dplyr::select(test, -crime)
  • predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
  • cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class )
##           predicted
## correct    low med_low med_high high
##   low       14      12        0    0
##   med_low    3      14       11    0
##   med_high   0       3       18    2
##   high       0       0        0   25

The predictive performance of the high class is quite high compared to other classes which are quite poor. The low_medium was the worst wrongly guessed/predicted.

7. Clustering

  • reload the Boston dataset
# load MASS and Boston
library(MASS)
data('Boston')
  • standardise the data to get comparable distances later
boston_standard <- scale(Boston)
  • **For calculating he distances between observations, I’ll be using euclidean distance. There are other methods also, By default, dist() fucntion in r uses te euclidean distance, hence, there is no need to specify but it might be useful for clarity to specify
  • euclidean distance matrix
dist_eu <-dist(boston_standard, method= "euclidean")
  • look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
  • now, I’ll try out the manhattan distance too
  • ** manhattan distance matrix**
dist_man <- dist(boston_standard, method="manhattan")
  • look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

k-means clustering

K-means is a popularly used clustering method. It is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. The distance matrix is counted automatically by the kmeans() function. source - run k-means algorithm on the dataset

km <-kmeans(Boston, centers = 3)
  • plot the Boston dataset with clusters
pairs(Boston[6:8], col = km$cluster)

  • determine the number of optimal number of clusters
set.seed(123)
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

From the plot, we can see that the value of changed drastically at about 2, hence, the centers will be set as 2 - k-means clustering

km <-kmeans(Boston, centers = 2)
  • plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

from the above, it is hard to see the variables, hence, I’ll select some of them for plotting.

pairs(Boston[4:7], col = km$cluster)

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

boston_standard2<-scale(Boston)
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_standard2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_standard2, centers = 9)

#convert km to dataframe
boston_standard2<-as.data.frame(boston_standard2)

lda.fit_clus<- lda(km$cluster~., data=boston_standard2)

# plot the Boston dataset with clusters
pairs(boston_standard2[,3:7], col = km$cluster)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit_clus, dimen = 2, col = classes, pch= classes)
lda.arrows(lda.fit, myscale = 2)

Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)


# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
  • using the plotly package for 3D plotting of the matrix products’ columns.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', col=train$crime)

I got a better visualisation by using the crime classes as the colour representation.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', col=km$cluster)